1. Introduction

This project is an exploratory data analysis of a dataset originating from the computer game “Starcraft II” (henceforth referred to as “SC2” for short). SC2 belongs to the real-time strategy (RTS) genre, which typically involve players managing military bases and directly controlling armies in order to combat opposing players’ armies and destroy their bases. In SC2, players control one of three factions, each with completely different playstyles and abilities. All player actions happen in real-time. Thus, a successful player must be able to construct buildings, gather resources, and raise an army while simultaneously controlling existing combat units. This dichotomy is referred to macromanagement and micromanagement, or “macro” and “micro” for short. In addition, each player’s view of the battlefield is limited by the “fog of war”, a term that means that each player can only see in locations where his units and buildings are present. Otherwise, that part of the map is obscured. Thus, a successful player must be able to construct buildings, gather resources, and raise an army while simultaneously controlling existing combat units and maintaining awareness of what the enemy is doing on the battlefield. A more detailed description of the game written by the researchers who compiled this dataset can be found here.

The purpose of this study is to investigate SC2 as a tool for measuring how people learn and acquire expertise. These replay files are a complete record of everything that happens during the course of a game, including various metrics related to the game state, what each player is looking at, and where every game unit is located at any given point in time. These games were played on the 1 vs. 1 online ladder system, which places players in one of seven “leagues” according to a “ladder rank” calculated by a matchmaking algorithm. One’s ladder rank is based on his or her win-loss record and the ladder rank of the opponent in each game. When a player wishes to play, the system attempts to find an opponent with a similar ladder rank. Once a player becomes skilled enough to consistently defeat opponents within the same league, the matchmaking system will begin matching the player against opponents of the next highest league. If the player wins enough of these games, then he or she will be promoted to that league. Similarly, if a player is consistently losing against opponents within the same league, the matchmaking system will begin matching the player against opponents in the next lowest league. If the player loses enough of these matches, then he or she will be demoted to that league.

Chaotic scenes such as this one are typical of almost every game of Starcraft II

Chaotic scenes such as this one are typical of almost every game of Starcraft II

What makes SC2 an interesting game to study is the fact that the sheer complexity of the game in terms of all the things that could happen at any given moment compounded across an entire game means that every game is unique. Each game can unfold in a seemingly infinite number of different ways. The popularity of SC2 has led to a sustained professional scene, where players are sponsored by teams and compete in tournaments for cash prizes. These tournaments can take place online or offline, where players are flown into a studio or arena to compete in front of a live audience. These events are accompanied by commentators who serve a similar role to those in traditional sports. In the past, the first place prizes at the most prestigious tournaments have reached $100,000. This competitive scene has provided the game’s passionate fanbase with a clear motivation for becoming better at the game. The professional players serve as role models for how to become better (A beginner’s guide to competitive SC2 can be found here). As a passionate member of this community during the time that this dataset was compiled, I made a serious (but short-lived) attempt to improve my SC2 skill level, during which I absorbed various advice and generally accepted wisdom on how one can get better at the game, as well as a deep appreciation for the talents of professional SC2 players. With this dataset, I would like to see if these community best practices are supported by the differentiation of skill levels based on the league placement of the study participants. Throughout this project, I consulted with my friend who introduced me to this scene and who is more knowledgeable and skilled than me at this game. We discussed what opinions the community had on the importance of various metrics as well as how to interpret these metrics and whether or not they contain any nonsensical values.

At the conclusion of their paper, the researchers discussed the next study that they were embarking on, which involved accepting large numbers of replay files from each participant in order to search for changes in skill level over time. Though each data point in this dataset represents a single moment in time for one player, I want to explore how in-game metrics change across age groups and experience levels both within and across ranks. I plan on using variables in the dataset related to time to see if there are any general trends over time that can be discerned. The researchers approached this study from a cognitive science perspective, introducing a new unit of measurement called the Perception-Action Cycle (PAC), which attempts to quantify skill in terms of the amount of time elapsed between a player perceiving an event and acting upon it (a more detailed explanation of the PAC can be found here). They hypothesized that the variables that are most correlated with skill change as a player improves and moves up the ranks of the ladder and showed that machine learning methods can be used to predict a player’s rank based on such variables. I wish to avoid using variables related to PAC since I don’t believe that I have the necessary cognitive science background to understand how they were derived and what they signify. Also, I want to avoid retreading the path the researchers took, since their paper focused heavily on PAC and its relation to skill level.

2. Team

Since I am working alone for this project, I am planning on exploring and analyzing this dataset with the following agenda:


  1. Check the self-reported variables for nonsensical/missing/outlier values and decide on a way to deal with them.
  2. Analyze the three self-reported time variables not derived from the replay files, TotalHours, HoursPerWeek, and Age.
  3. Propose an approximate order of importance among the in-game metrics included in the dataset (in terms of which skills the SC2 community deems essential for increasing skill level as opposed to what’s “nice to have”, similar to Maslow’s hierarchy of needs).
  4. Explore these variables individually in the order proposed.
  5. Visualize these variables together with the time variables to get an imprecise idea of how these metrics changes over time.
  6. Create multivariate plots to find possible relationships between different metrics across different leagues and experience levels.

3. Analysis of Data Quality

The dataset used for this analysis originates from the research paper Video Game Telemetry as a Critical Tool in the Study of Complex Skill Learning (Thompson et al, 2013) and can be found here. From August 12th, 2011 to September 19th, 2011, the researchers solicited members of various online SC2 community hubs for their game replay files, from which I found out about this study. Contributors were also asked to complete a survey that asked about their age, skill level, and playing habits. Since each replay file only contains information about what happens during the course of the one game that the file originated from, it is unable to provide players’ personal details such as age or how long they’ve been playing the game. Additionally, a player’s league isn’t included since that information doesn’t directly affect the game once the matchmaking system is done finding an opponent. The league names, in order of increasing skill, are Bronze, Silver, Gold, Platinum, Diamond, Master, and GrandMaster. Data points with a LeagueIndex value of 8 indicate replay files of professional games that were shared with the public. These replays were subsequently submitted to the study by community members, which is why these replays all have missing values for self-reported variables.

It should be noted that “professional” simply means that those replays are of games that were played by professional players in tournaments. There is no “professional” league in the ladder system. Also, the GrandMaster League is limited to the top 200 players. It is necessary for a player to be successful in the GrandMaster League in order to play SC2 professionally, so there is a chance that professional players can be represented in both leagues in this dataset. However, this is unlikely due to their tendency to hide their identities when playing online as much as possible. This is so that their rivals don’t get opportunities to study them. Therefore, it is highly unlikely that a professional player participated in this survey and willingly submitted his own replays of GrandMaster League games played online to this dataset. Due to the small sample size of professional replays in the dataset (55/3395) and the lack of accompanying player information I decided to remove these replays from the datast.

My analysis begins with loading the csv file containing the dataset and converting integer variables that were imported as factors. I then create a new variable called rank that adds the proper name for each player’s league based on the LeagueIndex variable. While the variables derived from the replay files are inherently accurate, the variables derived from survey answers should be checked for inaccuracies and irregularities.

library(tidyverse)

sc <- read.csv('StarCraft2ReplayAnalysis-1.csv')

#Remove professional replays

sc <- filter(sc, sc$LeagueIndex != 8)

#Convert factors back to integers
#Automatically replace "NULL" string values with "NA"

sc$TotalHours <- as.integer(as.character(sc$TotalHours))
sc$HoursPerWeek <- as.integer(as.character(sc$HoursPerWeek))
sc$Age <- as.integer(as.character(sc$Age))

#Add a new variable "League" so that the values in the included "LeagueIndex" variable are associated with the names of each league

rank <- c('Bronze', 'Silver', 'Gold', 'Platinum', 'Diamond', 'Master', 'GrandMaster')
rank <- factor(rank, levels = c('Bronze', 'Silver', 'Gold', 'Platinum', 'Diamond', 'Master', 'GrandMaster'))
sc <- mutate(sc, League = rank[LeagueIndex])

HoursPerWeek and TotalHours

Since my analysis will partly be focused on the change in player metrics as a function of time, it would be appropriate to start with the two time variables, HoursPerWeek and TotalHours. These variables are prone to reporting inaccuracy due to the fact that participants are attempting to generalize several months of playing experience and habits into two numbers. By removing outlier values and drawing scatterplots, it is also obvious that the reported values for both variables have been rounded. For TotalHours, the rounding looks to have been done by the respondents since there are data points scattered between rounded values. For HoursPerWeek, I assume that respondents could only choose from certain values preselected by the researchers rather than typing their answer manually.

ggplot(sc, aes(sc$HoursPerWeek)) + geom_density(bw = 2.5, fill = 'purple', alpha = .5) + labs(x = 'Reported Hours Played Per Week', title = 'HoursPerWeek Outliers Fenced') + theme(panel.background = element_rect(fill = 'white'), panel.grid.major =  element_line(color = 'gray'))

ggplot(sc, aes(sc$HoursPerWeek)) + geom_density(bw = 2.5, fill = 'purple', alpha = .5) + labs(x = 'Reported Hours Played Per Week', title = 'HoursPerWeek Outliers Removed') + theme(panel.background = element_rect(fill = 'white'), panel.grid.major =  element_line(color = 'gray'))

summary(sc$HoursPerWeek)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    8.00   12.00   15.91   20.00  168.00       1
summary(sc$TotalHours)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max.      NA's 
##       3.0     300.0     500.0     960.4     800.0 1000000.0         2

What immediately stands out in the summary statistics and the density plots are the maximum values for both variables. The maximum value for HoursPerWeek corresponds to 24 hours per day of playing time every week, while the maximum reported value of 1,000,000 for TotalHours far exceeds the number of hours that had elapsed since SC2 was released to the time that the data collection period concluded. When I discussed this matter with Professor Robbins, one of her suggestions was to “fence” the values such that outliers are replaced with a sensible ceiling value. From interviews of professional players as well as hearsay from those who keep in contact with them, I chose 98 hours played per week as a ceiling value. This corresponds to 14 hours played per day, which is a habit that could most likely only be sustained by professional players signed to a major team. Such teams often operate a “team house” where all the players live and train together. It is not uncommon for the sponsors to hire maids to cook and clean so that the players can focus all their attention on playing SC2. In such cases, it is not unusual for players to devote 10-14 hours per day to practicing for upcoming important matches.

Another suggestion was to simply remove these outliers from the dataset, since the respondents might not have given a serious answer to the question. When both methods are compared side by side below, there doesn’t seem to be any appreciable difference due to the small number of outliers (56 out of 3,339). As for the single data point with a value of 0 hours per week, I interpreted that as the respondent playing the game very infrequently. Because there was only one such data point and the fact that the next closest value is 2, I decided not to remove it.

#Restrict "HoursPerWeek" to a maximum value of 98

sc_restrict <- sc
sc_restrict$HoursPerWeek <- ifelse(sc_restrict$HoursPerWeek > 98, 98, sc_restrict$HoursPerWeek)

ggplot(sc_restrict, aes(sc_restrict$HoursPerWeek)) + geom_density(bw = 2.5, fill = 'purple', alpha = .5) + labs(x = 'Reported Hours Played Per Week', title = 'HoursPerWeek Outliers Fenced') + scale_x_continuous(breaks = c(seq(0, 100, by = 5))) + theme(panel.background = element_rect(fill = 'white'), panel.grid.major =  element_line(color = 'gray'))

#Remove data with "HoursPerWeek" values greater than 98

sc_remove <- sc
sc_remove <- filter(sc, HoursPerWeek <= 98)

summary(sc_restrict$HoursPerWeek)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    8.00   12.00   15.87   20.00   98.00       1
summary(sc_remove$HoursPerWeek)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0     8.0    12.0    15.8    20.0    98.0
ggplot(sc_remove, aes(sc_remove$HoursPerWeek)) + geom_density(bw = 2.5, fill = 'purple', alpha = .5) + labs(x = 'Reported Hours Played Per Week', title = 'HoursPerWeek Outliers Removed') + scale_x_continuous(breaks = c(seq(0, 100, by = 5))) + theme(panel.background = element_rect(fill = 'white'), panel.grid.major =  element_line(color = 'gray'))

This variable looks to be unimodal after adjusting the bandwidth of the density plot, with a mode of approximately 7 and a mean of 15.87. The mean value corresponds to around 2 hours per day of play, which sounds reasonable for that time period, during which the game was still relatively new and the fanbase was at its largest. The distribution is right-skewed, which gives an idea of how many ‘typical’ players there are compared to the more dedicated ones

Similarly, I tried both restricting the maximum value for TotalHours to 8,106 and removing data points that exceed that value. I arrived at this theoretical maximum by counting the number of days from the earliest date that a member of the general public could begin playing the game consistently (the pre-release beta test period began on February 17th, 2010) to the end of the data collection period, September 19th, 2011. Assuming that an extremely dedicated fan played the theoretical maximum of 14 hours per day every day during that period, he or she will have played the game for a total of 8,106 hours.

#Restrict "TotalHours" to a maximum value of 8,106

sc_restrict$TotalHours <- ifelse(sc_restrict$TotalHours > 8106, 8106, sc_restrict$TotalHours)
summary(sc_restrict$TotalHours)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     3.0   300.0   500.0   650.2   800.0  8106.0       2
ggplot(sc_restrict, aes(sc_restrict$TotalHours)) + geom_histogram(binwidth = 200, fill = 'purple', alpha = .5) + labs(x = 'Reported Total Number of Hours Played', title = 'TotalHours Outliers Restricted') + theme(panel.background = element_rect(fill = 'white'), panel.grid.major =  element_line(color = 'gray'))

#Remove data with "TotalHours" values greater than 8,106

sc_remove <- filter(sc_remove, TotalHours <= 8106)
summary(sc_remove$TotalHours)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     3.0   300.0   500.0   633.7   800.0  6000.0
ggplot(sc_remove, aes(sc_remove$TotalHours)) + geom_histogram(binwidth = 200, fill = 'purple', alpha = .5) + labs(x = 'Reported Total Number of Hours Played', title = 'TotalHours Outliers Removed') + theme(panel.background = element_rect(fill = 'white'), panel.grid.major =  element_line(color = 'gray'))

Again, both methods for dealing with outliers yield almost identical distributions. In the above histograms, several outlier values that are well below 8,106 hours and yet well above the median of 500 can be seen. Sustaining a 14 hour per day SC2 habit in addition to sleep time and any possible school and work responsibilities would be quite difficult. Even a professional player would have trouble keeping up with that level of practice for an extended period of time without feeling mental exhaustion or experiencing wrist injuries. Therefore, I decided to further restrict this variable to a maximum value of 4,000 hours, which corresponds to approximately 7 hours per day.

#Further restrict "TotalHours" to a maximum value of 4,000

sc_restrict$TotalHours <- ifelse(sc_restrict$TotalHours > 4000, 4000, sc_restrict$TotalHours)
summary(sc_restrict$TotalHours)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     3.0   300.0   500.0   639.2   800.0  4000.0       2
ggplot() + geom_histogram(data = sc_restrict, aes(x = sc_restrict$TotalHours), binwidth = 200, fill = 'purple', alpha = .5) + labs(x = 'Reported Total Number of Hours Played', title = 'TotalHours Outliers Restricted') + scale_x_continuous(breaks = c(seq(0, 4000, 250))) + theme(panel.background = element_rect(fill = 'white'), panel.grid.major =  element_line(color = 'gray'))

ggplot() + geom_density(data = sc_restrict, aes(x = sc_restrict$TotalHours), bw = 75, fill = 'purple', alpha = .5) + labs(x = 'Reported Total Number of Hours Played', title = 'TotalHours Outliers Restricted') + scale_x_continuous(breaks = c(seq(0, 4000, 250)))  + theme(panel.background = element_rect(fill = 'white'), panel.grid.major =  element_line(color = 'gray'))

#Further remove data with "TotalHours" values greater than 4,000


sc_remove <- filter(sc_remove, TotalHours <= 4000)
summary(sc_remove$TotalHours)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     3.0   300.0   500.0   625.2   800.0  4000.0
ggplot() + geom_histogram(data = sc_remove, aes(x = sc_remove$TotalHours), binwidth = 200, fill = 'purple', alpha = .5) + labs(x = 'Reported Total Number of Hours Played', title = 'TotalHours Outliers Removed') + scale_x_continuous(breaks = c(seq(0, 4000, 250))) + theme(panel.background = element_rect(fill = 'white'), panel.grid.major =  element_line(color = 'gray'))

ggplot() + geom_density(data = sc_remove, aes(x = sc_remove$TotalHours), bw = 75, fill = 'purple', alpha = .5) + labs(x = 'Reported Total Number of Hours Played', title = 'TotalHours Outliers Removed') + scale_x_continuous(breaks = c(seq(0, 4000, 250)))  + theme(panel.background = element_rect(fill = 'white'), panel.grid.major =  element_line(color = 'gray'))

By removing the outlier values, the tail small bump at 4,000 hours is noticeably reduced. The distribution for TotalHours looks to have a larger variance compared to that of HoursPerWeek. Both distributions are similarly right-skewed, which is to be expected since these two variables are most certainly correlated. Depending on the parameter settings for the density plot and histogram, there might be evidence of bimodality, with a distinct mode of approximately 450 and another possible one of at approximately 700. The mean of 639.2 corresponds to approximately 1.5 hours of SC2 per day, which translates to around four average length 1 vs. 1 matches.

Age

Although age is only an indirect indicator of the passage of time, I want to search for trends across different age groups. In the Starcraft community, it is commonly believed that one’s skill declines with age, such that past the early to mid 20’s, hand movement and reaction speeds begin to slow down. This belief is reinforced by the fact that an overwhelming majority of tournament champions are relatively young, often still in their teens. While the researchers plan to investigate whether or not such a physiological decline actually occurs in their next study, I feel that it would be sensible to see if there’s any supporting evidence for this belief using this dataset, even though it’s not as scientifically valid as analyzing a large quantity of replay files associated with individual players’ experiences with the game over a long period of time.

#Continuing the analysis by removing outlier values rather than restricting them

sc <- sc_remove

summary(sc$Age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   16.00   19.00   21.00   21.66   24.00   44.00
ggplot(sc %>% filter(HoursPerWeek < 50), mapping = aes(x = League, y = Age)) + geom_col(alpha= 0.3, color = 'purple') + labs(title='Age vs. League') + theme(panel.background = element_rect(fill = 'white'), panel.grid.major =  element_line(color = 'gray'))

ggplot(sc, aes(sc$Age)) + geom_histogram(binwidth = 1, fill = 'purple', alpha = .5) + scale_x_continuous(breaks = c(seq(10,44, 2))) + labs(x = 'Age of the Participant') + theme(panel.background = element_rect(fill = 'white'), panel.grid.major =  element_line(color = 'gray'))

ggplot(sc, aes(sc$Age)) + geom_density(bw = .5, fill = 'purple', alpha = .5) + labs(x = 'Age of the Participant') + theme(panel.background = element_rect(fill = 'white'), panel.grid.major =  element_line(color = 'gray')) + scale_x_continuous(breaks = c(seq(10,44, 2)))

A noticeable feature of this distribution is the abrupt cutoff point at 16. This is most likely due to ethical restrictions that the researchers adhered to with respect to allowing minors to participate in the study, rather than a lack of younger SC2 players. Were it not for this, the age distribution would look almost normally distributed, with a mode of 21 and mean of 21.65. The distribution proceeds fairly smoothly to the maximum age of 44. Due to the smoothness of the distribution, I decided to not treat participants in their early 40’s as outliers. The slope of the distribution significantly decreases past age 22, which lends support to the community’s perceived notions concerning age and skill level.

APM (Actions per Minute)

Players must be able to click on the right parts of the screen at the right time while reacting to what the opponent is doing.

Players must be able to click on the right parts of the screen at the right time while reacting to what the opponent is doing.

Proceeding to the in-game metrics, I feel that the most obvious one to explore is APM, short for Actions Per Minute. An action is defined to be any key press or mouse click performed by the player. In SC2, APM is almost synonymous with level skill. It is typically understood that if a player can’t perform actions at a rate that is comparable to that of the opponent, then he or she simply can’t keep up and will eventually be overwhelmed due to the fact that the opponent can accomplish more in the same amount of time. The description accompanying this dataset does not specify whether the included APM values are average or maximum values achieved during each game. I assume that they represent average values since average APM is one of the metrics that the game presents to the player at the conclusion of each game.

During battles, a skilled player should be able to position his army properly and micromanage several units simultaneously without neglecting the more mundane tasks of producing workers and constructing buildings back at base. Winning a battle while forgetting to manage one’s bases could leave one in a worse position and allow the opponent to catch up in the near future.

summary(sc$APM)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   22.06   79.12  107.00  114.30  139.90  389.80
ggplot(sc, aes(sc$APM)) + geom_density(bw = 3, fill = 'purple', alpha = .5) + labs(x = 'Actions Per Minute') + theme(panel.background = element_rect(fill = 'white'), panel.grid.major =  element_line(color = 'gray')) + scale_x_continuous(breaks = c(seq(20,400, 20)))

The distribution for APM looks to be approximately normally distributed, with a median of 108 and a mean of 117. The slope on the left side is much steeper, with a minimum value of 20. This sounds reasonable, since 20 actions per minute can be achieved by nearly any able-bodied person using a computer without much effort. On the other end, we see some values exceeding 360, which translates to a sustained rate of 6 actions per second. Though there have been a few professional players who were known for consistently attaining such exceptionally high APM values, it could also be the case that the APM outlier values in this dataset are due to some players inflating their APM by rapidly performing meaningless actions, such as clicking empty spaces. Often, players do this at the beginning of a game to warm up their hand muscles since there’s typically not much to do at such an early stage of a game. However, it could be the case that some players continue doing this throughout their games, possibly with the intention of inflating their APM rate. A few years after these replays were collected, an improved APM measurement that filters out such meaningless actions was added to the game. Using that metric, we would be unlikely to see such extreme values.

ggplot(sc, aes(sc$League, sc$APM)) + geom_boxplot(outlier.color = 'blue', fill = 'purple', alpha = 0.5) + theme(panel.background = element_rect(fill = 'white'), panel.grid.major =  element_line(color = 'gray')) + labs(x = 'League', y = 'Actions per Minute') + scale_y_continuous(breaks = c(seq(0, 400, 25)))

Plotting the distribution of APM versus league rank supports the idea these outlier APM values were artificially inflated. They originate from players in Diamond and Master League, which is far from the highest levels of competitive play. These outliers are far removed from the median values of their respective leagues. The fact that several values exceed even the highest APM for GrandMaster players suggests that they are not realistically attainable values for players in these leagues. Therefore, I decided to remove data points where APM values exceed 275, which is the approximate median value for the professional players represented in this dataset.

#Remove data points belonging to Diamond and Master League that have over 275 APM

sc <- sc[!(sc$APM > 275 & sc$League == 'Diamond'),]
sc <- sc[!(sc$APM > 275 & sc$League == 'Master'),]

ggplot(sc, aes(sc$APM)) + geom_density(bw = 3, fill = 'purple', alpha = .5) + labs(x = 'Actions Per Minute') + theme(panel.background = element_rect(fill = 'white'), panel.grid.major =  element_line(color = 'gray')) + scale_x_continuous(breaks = c(seq(20,400, 20)))

ggplot(sc, aes(sc$League, sc$APM)) + geom_boxplot(outlier.color = 'blue', fill = 'purple', alpha = 0.5) + geom_smooth(method = 'loess', se = FALSE, color = 'blue', size = 1.5, aes(group = 1)) + theme(panel.background = element_rect(fill = 'white'), panel.grid.major =  element_line(color = 'gray')) + labs(x = 'League', y = 'Actions Per Minute') + scale_y_continuous(breaks = c(seq(0, 400, 25)))

Having removed these outlier values, the shape of the density plot did not change much, but the boxplot looks more reasonable. There is a positive correlation between League and APM that supports the idea that APM is a reasonable indicator of skill level. However, it should be noted that player actions can rarely be accurately categorized as either “meaningful” or “meaningless”. There are often several events happening simultaneously across the battlefield that demand a player’s attention. SC2 expertise heavily depends on the ability to prioritize these events and focus attention on the most important ones. To give two extreme examples, a player who devote full attention to only controlling a single unit at a time will most likely not win too many games, much like a player who instantly shifts his or her attention to the latest event without any prioritization. Even with “meaningful” actions, the decisions that a player makes can be judged on a continuous scale of effectiveness that depends on the context of what is happening and what the player knows at that moment. In broadcasted games, commentators are expected to analyze these decisions and point out which ones they think are wise or unwise. Thus, one cannot declare that one player is definitely better than another player based solely on a comparison of APM.

4. Executive Summary

My main goal with this project were to either prove or disprove the heirarchy of skills that is often touted within the SC2 community as well as to observe the effects that time, in the form of player age and hours spent playing SC2, had on these skills. From my analysis of the dataset, I concluded that the heirarchy and its ordering of skills is mostly accurate. I also discovered that time appears to only have a noticeable effect on a specific subset of the skills.

The heirarchy of skills can be thought of as a road map for how a player can gradually improve in skill progress through the various leagues of the online ladder system. Starting from the lowest level, Bronze League, a player should focus on learning each listed skill in the order of the heirarchy. Mastering the initial skills will allow a player to quickly rise through the lower ranks, whereas mastery of the latter skills differentiate the mid-tier players from the high-tier ones. More specifically, players should prioritize learning skills that will allow them to leave Bronze League and enter the mid-tier leagues. Once they have the necessary foundational skills, they can devote effort in learning skills that will allow to advance to the upper leagues.

I attempted to validate this skil ordering by looking at boxplots of the various in-game metrics and comparing the changes in median and mean values when transitioning from lower leagues to higher leagues. For each metric, I drew a line of best fit to observe the rate of change in that metric. Primarily, I noted the slope of the line as it transitioned between different pairs of adjacent leagues and looked for the transitions that showed steep slopes. This steepness would indicate that there is a noticeable difference in mastery of that skill between players of those two leagues. I used the following criteria to validate the relative importance of each skill in being promoted from Bronze League and moving up the ladder:

  1. I deemed skills that exihibited a steep slope when transitioning through the lower leagues to be the most crucial towards improving enough to advance past the lower leagues.
  2. Next, I looked at how steep the slopes of each line were across all leagues. Lines that maintained steepness were deemed to be important for players of all skill levels.
  3. Finally, lines that were relatively flat across all leagues were deemed to be equally important across all leagues.
The ordering of the metrics was validated by observing the slopes of the lines of best fit across different leagues.

The ordering of the metrics was validated by observing the slopes of the lines of best fit across different leagues.

With this method, I came to the conclusion that the skill heirarchy proposed by the community is supported by this dataset and should be followed by novice SC2 players that hope to improve.

This dataset contains three variables that measure time in different ways. One variable is the age of the player. The other two are the total hours the player has played the game and the number of hours played per week. I primarily wanted to see if there was any evidence supporting the belief that player skill deteriorates with age. Additionally, I wanted to see which player skills improve the most with additional experience playing the game. To achieve this, I used a correlation plot to show the relationships between these variables and the skill variables.

5. Main Analysis

It is commonly thought that higher APM comes naturally through becoming more skilled at SC2. In the SC2 community, it is believed that there is a rough heirarchy of skills that one should acquire in order to become skilled. The variables corresponding to these skills, ranked in approximate descending order of importance, are the following:

  1. WorkersMade
  2. MinimapAttacks / MinimapRightClicks
  3. AssignToHotkeys / SelectByHotkeys
  4. TotalMapExplored
  5. UniqueUnitsMade
  6. ComplexAbilitiesUsed / ComplexUnitsMade
  7. UniqueHotkeys

It should be noted that these variables are rates expressed in “timestamps”, which is the way that the game calculates time. There are approximately 88.5 timestamps per second according to the dataset documentation. Since such a short measure of time creates very small values, I will express these variables in minutes to make the data and graphs easier to understand. Throughout this section, I encountered outliers for several variables that made it difficult to read the graphs. I dealt with these outlier values by removing them from the dataset. I made this decision due to these data points comprising an extremely small portion of the dataset. Also, since most of these metrics are more obscure and not discussed often in the community, it would be difficult to decide on a “reasonable” ceiling value to restrict these variables to.

In addition, I decided to omit the MaxTimeStamp variable from my analysis. This variable indicates the length of each game. While there may be some correlation between how skilled a player is and his or her average game length, each player’s choice in strategy has a major effect on how long a particular game will be. For example, before a game begins, a player can decide to invest significant resoures early on into doing a quick surprise attack to catch the opponent off guard. If such an attack fails, then the attacking player will be in a vulnerable position and may be defeated by the opponent’s counterattack. In this case, the game is likely to end early regardless of the outcome. On the other hand, both players can just as easily decide to focus on their own army and building production while avoiding conflicts early on. Such a game would take much longer to finish. Since game length is largely affected by conscious choices that players make in addition to behavior derived from skill level, it would be difficult to isolate these two factors in orde to assess MaxTimeStamp as an indicator of skill level.

1) WorkersMade

The typical answer to the question of how one gets better at SC2 is to make more workers. This might seem surprisingly at first. Worker units are primarily responsible for collecting resources and constructing buildings. This might sound quite mundate, but collecting enough resources is of utmost importance, since this is the currency with which players build their army and bases. Collecting resources at a suboptimal rate due to having too few workers will create a bottleneck that will slow the rate at which your army can be produced. Additionally, workers can be serve as scouts or sacrificial pawns during attacks and defenses. People often tell novices that they can get promoted out of Bronze League with this simple strategy:constantly produce workers, collect lots of resources, build an army with those resources (the type of units built doesn’t really matter), and send that army straight to the enemy base.

These humble workers are busy collecting resources inside this player’s main base.

These humble workers are busy collecting resources inside this player’s main base.

This graph showing the number of workers each player possesses at any given time is just one of several pages of detailed statistics that are presented to players at the conclusion of each game.

This graph showing the number of workers each player possesses at any given time is just one of several pages of detailed statistics that are presented to players at the conclusion of each game.

The violin plot below shows that there is a slightly positive correlation between how workers are produced per minute and the rank of the player. By looking at the median lines, one can see a slight downward trend going from Master to GrandMaster. A more pronounced increase can be seen when looking at the outlier values from Diamond to GrandMaster. This demonstrates that one should find the proper balance between making too few workers (collecting resources at a suboptimal rate) and making too many (wasting money by producing unneeded workers who have nothing to do). On the other end, the transition from Bronze to Silver results in the median increasing from 2.934 to 3.84, which is more than the transition between any other pairs of adjacent leagues. This supports the claim that players can be promoted out of Bronze League simply by focusing on improving their worker production habits. Meanwhile, worker production rates level out at the higher leagues. The outlier values in Diamond league could be a sign of players producing more workers than needed, which would be an inefficient use of resources.

ggplot(sc, aes(sc$League, sc$WorkersMade*88.5*60)) + geom_violin(fill = 'purple', alpha = 0.5, draw_quantiles = c(0.5)) + geom_smooth(method = 'loess', se = FALSE, color = 'blue', size = 1.5, aes(group = 1)) + theme(panel.background = element_rect(fill = 'white'), panel.grid.major =  element_line(color = 'gray')) + labs(x = 'League', y = 'Workers Produced per Minute') + scale_y_continuous(breaks = c(seq(0,30, 2)))

#Display summary statistics for "WorkersMade" for each league

for(league in rank) {
  print(league)
  print(summary(filter(sc, sc$League == league)$WorkersMade*88.5*60))
}
## [1] "Bronze"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.4089  2.1570  2.9340  3.3260  4.0100 10.9400 
## [1] "Silver"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.8776  2.8510  3.8400  4.2540  5.0620 14.7200 
## [1] "Gold"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.508   3.323   4.292   4.871   6.006  19.840 
## [1] "Platinum"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.284   3.616   4.689   5.344   6.406  18.850 
## [1] "Diamond"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.482   4.207   5.391   6.188   7.469  27.340 
## [1] "Master"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.496   4.372   5.628   6.427   7.673  21.880 
## [1] "GrandMaster"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.570   3.777   5.333   6.413   7.267  16.940

2) MinimapAttacks / MinimapRightClicks

The second most important metric might be a surprising choice as well. The minimap is a map of the battlefield located on the bottom left corner of the user interface (as shown in the first gameplay screenshot in the introduction section). At a glance, players can see what is happening wherever he has an army presence. By clicking on the minimap, players can instantly switch views to the corresponding location. More importantly, he or she can order units to move to or attack that location. The alternative is for the player to manually scroll to that location with the mouse,which could take much longer depending on how big the battlefield is and where the playter was looking prior to scrolling. With the minimap, players can observe and react to situations occurring in the game faster and more efficiently. This difference can be compared to the difference between editing a long document with ability to jump to different sections by clicking or scrolling with a mouse versus having to manually scroll through the text using the arrow keys on the keyboard.

This is the interface through which players interact with their bases and armies in Starcraft II. The minimap is positioned at the bottom left corner.

This is the interface through which players interact with their bases and armies in Starcraft II. The minimap is positioned at the bottom left corner.

ggplot(sc, aes(sc$League, sc$MinimapAttacks*88.5*60)) + geom_boxplot(outlier.color = 'blue', fill = 'purple', alpha = 0.5) +  theme(panel.background = element_rect(fill = 'white'), panel.grid.major =  element_line(color = 'gray')) + labs(x = 'League', y = 'Minimap Attack Commands Issued per Minute')

ggplot(sc, aes(sc$League, sc$MinimapRightClicks*88.5*60)) + geom_boxplot(outlier.color = 'blue', fill = 'purple', alpha = 0.5) + theme(panel.background = element_rect(fill = 'white'), panel.grid.major =  element_line(color = 'gray')) + labs(x = 'League', y = 'Minimap Right Clicks Performed per Minute')

There are a few outlier data points that are obscuring the view of the league distributions for MinimapAttacks (17/3395) and MinimapRightClicks (7/3395). After removing these points, the distributions can be more clearly seen:

#Remove the "MinimapAttacks" outlier values and convert the values in terms of minutes

sc <- filter(sc, sc$MinimapAttacks <= 0.001003)

ggplot(sc, aes(sc$League, sc$MinimapAttacks*88.5*60)) + geom_boxplot(outlier.color = 'blue', fill = 'purple', alpha = 0.5) + geom_smooth(method = 'loess', se = FALSE, color = 'blue', size = 1.5, aes(group = 1)) + theme(panel.background = element_rect(fill = 'white'), panel.grid.major =  element_line(color = 'gray')) + labs(x = 'League', y = 'Minimap Attack Commands Issued per Minute') + scale_y_continuous(breaks = c(seq(0,7, 0.25)))

for(league in rank) {
  print(league)
  print(summary(filter(sc, sc$League == league)$MinimapAttacks*88.5*60))
}
## [1] "Bronze"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1502  0.1519  1.3840 
## [1] "Silver"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.07434 0.23670 0.24000 4.66900 
## [1] "Gold"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.1131  0.2838  0.3807  2.9940 
## [1] "Platinum"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.1845  0.3974  0.5104  4.7190 
## [1] "Diamond"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.07301 0.30320 0.55590 0.69710 4.78700 
## [1] "Master"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.1670  0.4848  0.7678  1.0520  5.3250 
## [1] "GrandMaster"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.05161 0.71670 1.27500 1.55500 1.95200 4.73800
#Remove the "MinimapRightClicks" outlier values and convert the values in terms of minutes

sc <- filter(sc, sc$MinimapRightClicks <= 0.0028)

ggplot(sc, aes(sc$League, sc$MinimapRightClicks*88.5*60)) + geom_boxplot(outlier.color = 'blue', fill = 'purple', alpha = 0.5) + geom_smooth(method = 'loess', se = FALSE, color = 'blue', size = 1.5, aes(group = 1)) + theme(panel.background = element_rect(fill = 'white'), panel.grid.major =  element_line(color = 'gray')) + labs(x = 'League', y = 'Minimap Right Clicks Performed per Minute') + scale_y_continuous(breaks = c(seq(0, 15, 1)))

for(league in rank) {
  print(league)
  print(summary(filter(sc, sc$League == league)$MinimapRightClicks*88.5*60))
}
## [1] "Bronze"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.3454  0.8103  1.1050  1.6140  7.4630 
## [1] "Silver"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.4721  1.0410  1.4630  2.0310 10.7900 
## [1] "Gold"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.666   1.271   1.745   2.342  12.070 
## [1] "Platinum"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.7308  1.4410  1.9160  2.5050 10.3900 
## [1] "Diamond"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.8876  1.6690  2.2530  3.1530 11.4100 
## [1] "Master"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   1.005   1.810   2.459   3.384  13.270 
## [1] "GrandMaster"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1333  1.6350  2.7240  2.9810  3.8270  8.6670

In both cases, there’s a positive correlation that appears to increase with the leagues. For MinimapAttacks, the slope of the LOESS line increases at the upper leagues. The large increase in the median between Master and GrandMaster (0.4867 to 1.277) shows the importance of using the minimap in attaining the highest level of skill. On the other end, the median also changes significantly from Bronze to Silver (o to 0.07434), which supports the belief that at least some minimap usage can be enough to be promoted out of Bronze League. It seems that many players from Bronze all the way to Platinum League don’t issue attack commands through the minimap or do so very rarely. For MinimapRightClicks, the variation between leagues isn’t as pronounced, with a roughly linear LOESS line. In the lower leagues, the median value of 0.8103 for Bronze League indicates that this is an action that most novice players already know to do. Therefore they just need to learn to do so more often.

3) AssignToHotkeys and SelectByHotkeys

It could be argued that learning to use hotkeys is equally essential to improving at SC2. Hotkeys are the computer game equivalent of keyboard shortcuts, and by not using them, one would be reduced to playing the game with only a mouse. To give an idea of how inefficient and time-consuming that would be, imagine having to type by using a mouse to click each letter on a virutal keyboard at the bottom of the computer screen. This analogy is not far from the truth since most of the buttons corresponding to actions in the game are located at the bottom of the user interface.

Each possible command in the game has an associated default hotkey. In addition, players can assign specific groups of units and buildings to one of the number keys. Therefore, by pressing a number key, a player can immediately select several units or buildings simultaneously, no matter where they are on the map or how far apart they are, and issue commands to them. The variable AssignToHotkeys indicates how often a player makes these assignments, whereas SelectByHotkeys indicates how often a player controls these previously assigned groups using the hotkeys.

ggplot(sc, aes(sc$League, sc$AssignToHotkeys*88.5*60)) + geom_boxplot(outlier.color = 'blue', fill = 'purple', alpha = 0.5) + theme(panel.background = element_rect(fill = 'white'), panel.grid.major =  element_line(color = 'gray')) + labs(x = 'League', y = 'Hotkey Assignments Made per Minute')

#Remove outlier values for the "AssignToHotkeys" variable

sc <- filter(sc, sc$AssignToHotkeys*88.5*60 <= 6.5)

ggplot(sc, aes(sc$League, sc$AssignToHotkeys*88.5*60)) + geom_boxplot(outlier.color = 'blue', fill = 'purple', alpha = 0.5) + geom_smooth(method = 'loess', se = FALSE, color = 'blue', size = 1.5, aes(group = 1)) + theme(panel.background = element_rect(fill = 'white'), panel.grid.major =  element_line(color = 'gray')) + labs(x = 'League', y = 'Hotkey Assignments Made per Minute') + scale_y_continuous(breaks = c(seq(0,7, 0.5)))

for(league in rank) {
  print(league)
  print(summary(filter(sc, sc$League == league)$AssignToHotkeys*88.5*60))
}
## [1] "Bronze"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.4811  0.8097  0.9849  1.3340  3.5450 
## [1] "Silver"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.5614  0.9851  1.1790  1.7030  3.6620 
## [1] "Gold"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.8148  1.3810  1.4990  2.0780  4.8760 
## [1] "Platinum"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1237  1.0910  1.7520  1.7860  2.3820  5.3920 
## [1] "Diamond"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1651  1.5140  2.1630  2.1910  2.7800  6.2370 
## [1] "Master"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3452  1.9320  2.6080  2.6890  3.3150  6.3600 
## [1] "GrandMaster"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.645   2.493   3.529   3.592   4.553   6.479
ggplot(sc, aes(sc$League, sc$SelectByHotkeys*88.5*60)) + geom_boxplot(outlier.color = 'blue', fill = 'purple', alpha = 0.5) + theme(panel.background = element_rect(fill = 'white'), panel.grid.major =  element_line(color = 'gray')) + labs(x = 'League', y = 'Hotkey Groups Selected per Minute')

After removing those outliers, there are still some data points with outlier values for the SelectByHotkeys variable (32/3395) that should be removed as well.

#Remove outlier values for the "SelectByHotkeys" variable and convert the values in terms of minutes

sc <- filter(sc, sc$SelectByHotkeys*88.5*60 < 150)

ggplot(sc, aes(sc$League, sc$SelectByHotkeys*88.5*60)) + geom_boxplot(outlier.color = 'blue', fill = 'purple', alpha = 0.5) + geom_smooth(method = 'loess', se = FALSE, color = 'blue', size = 1.2, aes(group = 1)) + theme(panel.background = element_rect(fill = 'white'), panel.grid.major =  element_line(color = 'gray')) + labs(x = 'League', y = 'Hotkey Groups Selected per Minute') + scale_y_continuous(breaks = c(seq(0,200, 10)))

for(league in rank) {
  print(league)
  print(summary(filter(sc, sc$League == league)$SelectByHotkeys*88.5*60))
}
## [1] "Bronze"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   1.558   3.566   5.741   7.274  74.270 
## [1] "Silver"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   3.083   5.813   8.154   9.809  74.290 
## [1] "Gold"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   4.692   8.201  11.600  13.500 104.000 
## [1] "Platinum"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   6.661  11.490  16.500  19.900 124.400 
## [1] "Diamond"
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##   0.2825  11.0700  18.2200  24.9900  31.9400 147.0000 
## [1] "Master"
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##   0.6472  15.9500  28.2200  36.1500  48.3100 148.7000 
## [1] "GrandMaster"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   14.30   30.07   43.55   50.58   65.27  142.20

In Bronze and Silver League, the median and mean values for AssignToHotkeys are close to 1. From my experience with the game, this reflects the habit of assigning one’s entire army to one hotkey and updating the assignment by adding newly produced units to that hotkey group. Players who control their entire army as a single large mass mostly do so because they lack the multitasking skills to effectively control multiple groups at a time. However, this setup makes it difficult for them to split their units up to deal with threats occurring simultaneously in different places. The median and mean values for SelectByHotkeys in the lower leagues are similarly low due to the fact that there’s no need to make too many hotkey selections when your entire army is assigned to only one or two hotkey groups. This type of habit seems to disappear as one progresses through the leagues. Comparing the lower leagues to the higher ones indicates that these variables are more important in being promoted from the middle leagues. The differences between the Diamond, Master, and GrandMaster League distributions are quite pronounced. This illustrates the importance of multitasking in SC2, and specifically, the importance of maintaining many hotkey groups and continuously selecting them to issue orders to attack and defend.

4) TotalMapExplored

The need for intelligence gathering during battles should be apparent. As a consequence of games happening under the fog of war, players have an incentive to constantly scout and observe the opposing player’s actions while hiding their own actions from enemy scouts. As previously mentioned, players can decide to perform sneak attacks early in a game to win the game quickly and decisively. Even in slower paced games, one should be aware of what kind of army the opponent is producing and where that army is moving. In SC2, each army unit has a “rock, paper, scissors” relationship with other types types of units, meaning that each unit is effective at fighting certain types of units while being ineffective against other types of units. Thus, by keeping watch of the opponent’s army, one can respond by producing the types of units that are effective against the opposing army.

In SC2, the battlefield is divided into squares that are about the size of the mouse pointer. The TotalMapExplored metric measures how many 24x24 grids of these squares a player has explored per timestamp. During the course of a game, players will naturally explore territory by virtue of expanding to new bases to collect more resources and sending their army to attack the opponent. However, a skilled player will go beyond this and intentionally send units to different parts of the map to confirm what the opponent is doing and to make sure that no sneak attacks are incoming. Timely information received from scouting can mean the difference between victory and defeat in many games.

ggplot(sc, aes(sc$League, sc$TotalMapExplored*88.5*60)) + geom_boxplot(outlier.color = 'blue', fill = 'purple', alpha = 0.5) + theme(panel.background = element_rect(fill = 'white'), panel.grid.major =  element_line(color = 'gray')) + labs(x = 'League', y = 'Number of 24x24 Map Grids Explored per Minute')+ scale_y_continuous(breaks = c(seq(0,5, 0.25)))

The league distribution medians seem to be increasing almost linearly, but I’d like to remove the few outlier values (2/3395) greater than 4.00 to make sure there isn’t a polynomial increase being obscured.

#Remove outlier values for the "TotalMapExplored" variable and convert the values in terms of minutes

sc <- filter(sc, sc$TotalMapExplored < 0.00074)

ggplot(sc, aes(sc$League, sc$TotalMapExplored*88.5*60)) + geom_boxplot(outlier.color = 'blue', fill = 'purple', alpha = 0.5) + geom_smooth(method = 'loess', se = FALSE, color = 'blue', size = 1.2, aes(group = 1)) + theme(panel.background = element_rect(fill = 'white'), panel.grid.major =  element_line(color = 'gray')) + labs(x = 'League', y = 'Number of 24x24 Map Grids Explored per Minute') + scale_y_continuous(breaks = c(seq(0,5, 0.25)))

for(league in rank) {
  print(league)
  print(summary(filter(sc, sc$League == league)$TotalMapExplored*88.5*60))
}
## [1] "Bronze"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.5726  0.9990  1.2220  1.3300  1.5570  3.6790 
## [1] "Silver"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.5853  1.0320  1.3040  1.3620  1.5640  3.6100 
## [1] "Gold"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.4848  1.0880  1.3060  1.3750  1.5780  3.1050 
## [1] "Platinum"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.5469  1.1930  1.4040  1.4680  1.6680  3.5180 
## [1] "Diamond"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.651   1.270   1.515   1.570   1.799   3.903 
## [1] "Master"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.621   1.303   1.548   1.636   1.866   3.696 
## [1] "GrandMaster"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9095  1.2530  1.6510  1.6190  1.8620  2.7500

There’s as relatively large increase in the median going from Bronze to Silver League which indicates that Silver League players leave their bases more often, though it’s unclear whether this is the result of purposeful scouting or just a consequence of being more active on the battlefield when performing other actions. Past Bronze League, it looks like the distributions are relatively until the middle, according to the LOESS line. Knowing to scout seems to be a a skill that’s acquired when transitioning from the Gold to Platinum Leagues. The transition from Master to GrandMaster League shows another relatively large median increase, which indicates that scouting is a crucial skill to acquire in order to reach the highest skill level.

5) UniqueUnitsMade

At first, I thought this variable measured the rate that players were producing units. This would be a measure of a player’s macromanagement and being able to efficiently spend his or her resources. However, upon converting the values to reflect minutes rather than timestamps, I saw that the values were much too low, considering that games typically last for 15-25 minutes and armies often consist of dozens of units.

ggplot(sc, aes(sc$League, sc$UniqueUnitsMade*88.5*60)) + geom_boxplot(outlier.color = 'blue', fill = 'purple', alpha = 0.5) + theme(panel.background = element_rect(fill = 'white'), panel.grid.major =  element_line(color = 'gray')) + labs(x = 'League', y = 'Unique Units Made per Minute') + scale_y_continuous(breaks = c(seq(0, 1.1, 0.05)))

These values would imply that on average, only a handful of units are produced per game, which is clearly not correct. I then tried multiplying this variable by MaxTimeStamp, which indicates how many timestamps each game contained, which would indicate the duration of each game. This should provide the total number of unique units produced per game.

ggplot(sc, aes(sc$League, sc$UniqueUnitsMade*sc$MaxTimeStamp)) + geom_boxplot(outlier.color = 'blue', fill = 'purple', alpha = 0.5) + theme(panel.background = element_rect(fill = 'white'), panel.grid.major =  element_line(color = 'gray')) + labs(x = 'League', y = 'Unique Units Made per Game') + scale_y_continuous(breaks = c(seq(0,15, 1)))

Notice that all of the values fall between 2 and 13. Given this range, it is most likely that this variable is the number of different types of units produced per game divided by the total length of the game considering that each player can only choose from at most 15 types of units to produce. A value of 2 would indicate that the player produced nothing but workers and the most basic, low-tech attacking unit. Given my new understanding of this variable, I decided that I would not use it any further during this project. While making a diverse army comprised of various types of units can be seen as a measure of how well a player can adapt to changing situations in the game by adjusting the composition of his or her army, it is also true that UniqueUnitsMade is naturally correlated with game length. This is due to the fact that the more advanced, hi-tech units cannot be produced early in the game due to their high costs and prerequisite infrastructure requirements (this idea is explained in more detail in the next section). As mentioned before, game length is in part determined by conscious decisions made by each player, so it would not be appropriate to interpret this metric as pure measure of skill.

6) ComplexAbilitiesUsed / ComplexUnitsMade

In SC2, there is a heirarchy in which all the different types of buildings and army units belong, called the “tech tree”. In the tech tree, buildings and units further up the tree can only be accessed once certain buildings further down in the tech tree have been constructed. This network of dependencies ensures that the inexpensive, low-tech buildings and units are accessible relatively early in the game, whereas the most powerful and expensive units are reserved for later in the game. Often, these units have devastating abilities that can be used repeatedly (with a predetermined recharge period in between uses) to devastate an opposing army if used correctly. This depends on the player manually clicking on the intended target in an accurate and timely manner. Otherwise, the attack misses its target and is wasted.

Several units have activated abilities that must be triggered at the right time or targeted abilities that require the player to directly click on the area or unit to be targeted.

Several units have activated abilities that must be triggered at the right time or targeted abilities that require the player to directly click on the area or unit to be targeted.

ComplexAbilitiesUsed measures how often a player uses such targeted abilities, whereas ComplexUnitsMade measures the rate at which these advanced units are produced. Properly utilizing these units and abilities requires another layer of micromanagement on top of everything else that a player needs to attend to during a game. Therefore, one would think that higher skilled players will use them with greater frequency.

ggplot(sc, aes(sc$League, sc$ComplexAbilityUsed*88.5*60)) + geom_boxplot(outlier.color = 'blue', fill = 'purple', alpha = 0.5) + theme(panel.background = element_rect(fill = 'white'), panel.grid.major =  element_line(color = 'gray')) + labs(x = 'League', y = 'Targeted Abilities Used per Minute')

ggplot(sc, aes(sc$League, sc$ComplexUnitsMade*88.5*60)) + geom_boxplot(outlier.color = 'blue', fill = 'purple', alpha = 0.5) + theme(panel.background = element_rect(fill = 'white'), panel.grid.major =  element_line(color = 'gray')) + labs(x = 'League', y = 'Advanced Units Produced per Minute')

Again, the lower league distributions seem like they might be obscured due to outliers (99/3395 for ComplexAbilityUsed and 3/3395 for ComplexUnitsMade).

#Remove outlier values for "ComplexAbilityUsed" and convert the values in terms of minutes

sc <- filter(sc, sc$ComplexAbilityUsed < 0.00188)

ggplot(sc, aes(sc$League, sc$ComplexAbilityUsed*88.5*60)) + geom_boxplot(outlier.color = 'blue', fill = 'purple', alpha = 0.5) + geom_smooth(method = 'loess', se = FALSE, color = 'blue', size = 1.2, aes(group = 1)) + theme(panel.background = element_rect(fill = 'white'), panel.grid.major =  element_line(color = 'gray')) + labs(x = 'League', y = 'Targeted Abilities Used per Minute') + scale_y_continuous(breaks = c(seq(0,10, 0.5)))

for(league in rank) {
  print(league)
  print(summary(filter(sc, sc$League == league)$ComplexAbilityUsed*88.5*60))
}
## [1] "Bronze"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.2217  0.2392  3.3450 
## [1] "Silver"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.4015  0.3045  9.3640 
## [1] "Gold"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.5606  0.6701  8.3970 
## [1] "Platinum"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.1625  0.7236  0.9349  9.7880 
## [1] "Diamond"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.3441  0.9119  1.2080  9.5040 
## [1] "Master"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.2591  0.9162  1.3490  9.1170 
## [1] "GrandMaster"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.5945  0.9948  4.3870
#Remove outlier values for "ComplexUnitsMade" and convert the values in terms of minutes

sc <- filter(sc, sc$ComplexUnitsMade < 0.00075)

ggplot(sc, aes(sc$League, sc$ComplexUnitsMade*88.5*60)) + geom_boxplot(outlier.color = 'blue', fill = 'purple', alpha = 0.5) + geom_smooth(method = 'loess', se = FALSE, color = 'blue', size = 1.2, aes(group = 1)) + theme(panel.background = element_rect(fill = 'white'), panel.grid.major =  element_line(color = 'gray')) + labs(x = 'League', y = 'Advanced Units Produced per Minute') + scale_y_continuous(breaks = c(seq(0,4, 0.25)))

for(league in rank) {
  print(league)
  print(summary(filter(sc, sc$League == league)$ComplexUnitsMade*88.5*60))
}
## [1] "Bronze"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.07784 0.00000 1.68800 
## [1] "Silver"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1293  0.0000  2.6230 
## [1] "Gold"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.2338  0.1362  3.1330 
## [1] "Platinum"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.3398  0.5252  3.4910 
## [1] "Diamond"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.3976  0.6984  3.5930 
## [1] "Master"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.4002  0.6635  3.2030 
## [1] "GrandMaster"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.3373  0.4641  2.0490

It appears that the lower league distributions for these two variables truly do look that, meaning that players in Bronze or Silver League use these advanced units and their abilities extremely rarely. In fact, the median values of ComplexUnitsMade for each distribution is 0. When looking at the means for each league, there’s a positive correlation from Bronze to Master League followed by a slight decrease in GrandMaster League. Therefore, these seem to be additional examples of skills that are not particularly used in the lower leagues but are used much more often among more skilled players.

7) UniqueHotkeys

Originally, I had ranked this metric in second place on the skill heirarchy. From the variable name and description from the researchers, I thought that this variable measured how often players used different hotkeys to input commands. However, when I converted the variables in terms of minutes, the values were much too low to make sense when placed in the context of APM and the other hotkey variables. For example, how could a GrandMaster player be using less than one unique hotkey per minute when he is most likely performing over 150 actions per minute? Upon consulting with my friend, we came to the conclusion that this metric actually measures the rate at which hotkeys that the player changed from the default settings. For example, players can change certain hotkey settings so that commonly used commands are assigned to neighboring keys on the keyboard to facilitate the issuing of commands. Conversely, a player can also decide to reassign a little used command as to avoid accidentally pressing it during hectic periods in the game. Given that there’s a steep learning curve involved in transitioning from not using hotkeys at all, beginning to use them, committing them to memory so that one can use them quickly and comfortably, to finally understanding one’s needs and customizing his or her hotkey setup to match that need, I would expect that there wouldn’t be much of a difference in the lower league distributions for this metric, but there would be a pronounced difference in the upper league distributions.

ggplot(sc, aes(sc$League, sc$UniqueHotkeys*88.5*60)) + geom_boxplot(outlier.color = 'blue', fill = 'purple', alpha = 0.5) + geom_smooth(method = 'loess', se = FALSE, color = 'blue', size = 1.2, aes(group = 1)) + theme(panel.background = element_rect(fill = 'white'), panel.grid.major =  element_line(color = 'gray')) + labs(x = 'League', y = 'Unique Hotkeys Used per Minute') + scale_y_continuous(breaks = c(seq(0, 2, 0.1)))

#Display summary statistics for "UniqueHotkeys" for each league

for(league in rank) {
  print(league)
  print(summary(filter(sc, sc$League == league)$UniqueHotkeys*88.5*60))
}
## [1] "Bronze"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.1065  0.2267  0.2404  0.3398  1.2800 
## [1] "Silver"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.1099  0.2124  0.2386  0.3348  0.7454 
## [1] "Gold"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.1306  0.2283  0.2629  0.3637  1.7920 
## [1] "Platinum"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.1561  0.2551  0.2775  0.3729  1.2480 
## [1] "Diamond"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.2023  0.2992  0.3368  0.4402  1.5580 
## [1] "Master"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.2627  0.3725  0.3939  0.4910  1.4970 
## [1] "GrandMaster"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2113  0.2942  0.4269  0.4350  0.5334  0.9168

The data seems to support my intuition. In the Lower Leagues, the LOESS line increases quite slowly, but suddenly increases in slow when transitioning from Platinum to Diamond League. It seems that this variable is not so important for less skilled player, but it’s a deciding factor for how skilled players are in the upper leagues.

Analysis of Each Metric with Respect to Time Variables

Having analyzed each of the relevant variables individually, I’d like to have a quick overview of the correlations between the various in-game metrics and each time variable.

library(corrplot)
library(viridis)
## Loading required package: viridisLite
#Visualize correlation coefficients between "Age" and all other variables

compare_age <- data.frame(c(sc[3], sc[6:11], sc[16:17], sc[19:20]))
corr_age <- cor(compare_age)
corrplot.mixed(corr_age, lower = 'number', upper = 'ellipse', col = viridis(256), title = 'Correlations with Player Age', mar=c(0,0,1,0))

#Visualize correlation coefficients between "HoursPerWeek" and all other variables

compare_week <- data.frame(c(sc[4], sc[6:11], sc[16:17], sc[19:20]))
corr_week <- cor(compare_week)
corrplot.mixed(corr_week, lower = 'number', upper = 'ellipse', col = viridis(256), title = 'Correlations with Hours Played per Week', mar=c(0,0,1,0))

#Visualize correlation coefficients between "TotalHours" and all other variables

compare_total <- data.frame(c(sc[5], sc[6:11], sc[16:17], sc[19:20]))
corr_total <- cor(compare_total)
corrplot.mixed(corr_total, lower = 'number', upper = 'ellipse', col = viridis(256), title = 'Correlations with Total Hours Played', mar=c(0,0,1,0))

At first glance, it seems like there are some correlations between some metrics and the three time variables. More importantly, it looks like all but one of the metrics are negatively correlated with age and positively correlated with playing more hours per week and in total. Interestingly enough, for all three time variables, the three variables with the greatest magnitude correlation coefficients are APM, SelectByHotkeys, and AssignToHotkeys.

ggplot(sc, aes(Age, APM)) + geom_point(color = 'purple', alpha = 0.25) + theme(panel.background = element_rect(fill = 'white'), panel.grid.major =  element_line(color = 'gray')) + labs(x = 'Player Age', y = 'Actions Per Minute') + geom_smooth(method = 'lm', se = FALSE, color = 'blue', size = 1.2, aes(group = 1)) + scale_y_continuous(breaks = c(seq(0, 400, 25)))

ggplot(sc, aes(Age, SelectByHotkeys*88.5*60)) + geom_point(color = 'purple', alpha = 0.25) + theme(panel.background = element_rect(fill = 'white'), panel.grid.major =  element_line(color = 'gray')) + labs(x = 'Player Age', y = 'Hotkey Groups Selected per Minute')  + geom_smooth(method = 'lm', se = FALSE, color = 'blue', size = 1.2, aes(group = 1)) + scale_y_continuous(breaks = c(seq(0,200, 25)))

ggplot(sc, aes(Age, AssignToHotkeys*88.5*60)) + geom_point(color = 'purple', alpha = 0.25) + theme(panel.background = element_rect(fill = 'white'), panel.grid.major =  element_line(color = 'gray')) + labs(x = 'Player Age', y = 'Hotkey Assignments Made per Minute') + geom_smooth(method = 'lm', se = FALSE, color = 'blue', size = 1.2, aes(group = 1)) + scale_y_continuous(breaks = c(seq(0,7, 0.5)))

It should be noted that there is an obvious correlation between the two hotkey variables and APM, which is due to the fact that proper hotkey usage contributes greatly to having a high APM by virtue of allowing faster and more efficient actions, which save the player time that can be spent on performing more actions. Similarly, This is the case for all of the other variables show in the correlation matrix as well. Therefore, it is natural that these variables would be correlated in the same direction.

One can see that the variance for any given age group decreases with age due to the fact that sample size for older players is much smaller than those in their teens and early twenties. Therefore, it looks like there is some evidence that age negatively affects these metrics.

The only variable with a positive correlation with Age is UniqueHotKeys.

ggplot(sc, aes(Age, UniqueHotkeys*88.5*60)) + geom_point(color = 'purple', alpha = 0.25) + theme(panel.background = element_rect(fill = 'white'), panel.grid.major =  element_line(color = 'gray')) + labs(x = 'Player Age', y = 'Unique Hotkeys Used per Minute') + geom_smooth(method = 'lm', se = FALSE, color = 'blue', size = 1.2, aes(group = 1)) + scale_y_continuous(breaks = c(seq(0, 2, 0.1)))

The slope of the regression line looks to be almost 0 unless one is looking closely. Therefore, it seems that age does not have much of an effect on how players customize their hotkey setup. This could be in part due to our previous finding that UniqueHotkeys seems to be the least important metric in the skill heirarchy.

ggplot(sc, aes(HoursPerWeek, APM)) + geom_point(color = 'purple', alpha = 0.25) + theme(panel.background = element_rect(fill = 'white'), panel.grid.major =  element_line(color = 'gray')) + labs(x = 'Reported Hours Played Per Week', y = 'Actions Per Minute') + geom_smooth(method = 'lm', se = FALSE, color = 'blue', size = 1.2, aes(group = 1)) + scale_y_continuous(breaks = c(seq(0, 400, 25))) + scale_x_continuous(breaks = c(seq(0, 100, by = 5)))

ggplot(sc, aes(HoursPerWeek, SelectByHotkeys*88.5*60)) + geom_point(color = 'purple', alpha = 0.25) + theme(panel.background = element_rect(fill = 'white'), panel.grid.major =  element_line(color = 'gray')) + labs(x = 'Reported Hours Played Per Week', y = 'Hotkey Groups Selected per Minute')  + geom_smooth(method = 'lm', se = FALSE, color = 'blue', size = 1.2, aes(group = 1)) + scale_y_continuous(breaks = c(seq(0,200, 25))) + scale_x_continuous(breaks = c(seq(0, 100, by = 5)))

ggplot(sc, aes(HoursPerWeek, AssignToHotkeys*88.5*60)) + geom_point(color = 'purple', alpha = 0.25) + theme(panel.background = element_rect(fill = 'white'), panel.grid.major =  element_line(color = 'gray')) + labs(x = 'Reported Hours Played Per Week', y = 'Hotkey Assignments Made per Minute') + geom_smooth(method = 'lm', se = FALSE, color = 'blue', size = 1.2, aes(group = 1)) + scale_y_continuous(breaks = c(seq(0,7, 0.5))) + scale_x_continuous(breaks = c(seq(0, 100, 5)))

These three variables exhibit a positive correlation with HoursPerWeek, with correlation coefficients that have slightly higher magnitudes when compared to those of Age.

ggplot(sc, aes(TotalHours, APM)) + geom_point(color = 'purple', alpha = 0.25) + theme(panel.background = element_rect(fill = 'white'), panel.grid.major =  element_line(color = 'gray')) + labs(x = 'Reported Total Number of Hours Played', y = 'Actions Per Minute') + geom_smooth(method = 'lm', se = FALSE, color = 'blue', size = 1.2, aes(group = 1)) + scale_y_continuous(breaks = c(seq(0, 400, 25))) + scale_x_continuous(breaks = c(seq(0, 4000, 250)))

ggplot(sc, aes(TotalHours, SelectByHotkeys*88.5*60)) + geom_point(color = 'purple', alpha = 0.25) + theme(panel.background = element_rect(fill = 'white'), panel.grid.major =  element_line(color = 'gray')) + labs(x = 'Reported Total Number of Hours Played', y = 'Hotkey Groups Selected per Minute')  + geom_smooth(method = 'lm', se = FALSE, color = 'blue', size = 1.2, aes(group = 1)) + scale_y_continuous(breaks = c(seq(0,200, 25))) +  scale_x_continuous(breaks = c(seq(0, 4000, 250)))

ggplot(sc, aes(TotalHours, AssignToHotkeys*88.5*60)) + geom_point(color = 'purple', alpha = 0.25) + theme(panel.background = element_rect(fill = 'white'), panel.grid.major =  element_line(color = 'gray')) + labs(x = 'Reported Total Number of Hours Played', y = 'Hotkey Assignments Made per Minute') + geom_smooth(method = 'lm', se = FALSE, color = 'blue', size = 1.2, aes(group = 1)) + scale_y_continuous(breaks = c(seq(0,7, 0.5))) +  scale_x_continuous(breaks = c(seq(0, 4000, 250)))

This set of scatterplots show an even greater positive correlation between the three variables and TotalHours. Although these findings might not be completely accurate due to the values being self-reported, I feel that the correlation is strong enough to warrant future reearch into the relationships among these variables.

The following is the code used for the visualizations used in the Executive Summary section.

library(gridExtra)
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
workers <- ggplot(sc, aes(sc$League, sc$WorkersMade*88.5*60)) + geom_boxplot(outlier.shape = NA, fill = 'purple', alpha = 0.25) + geom_smooth(method = 'loess', se = FALSE, color = 'blue', size = 0.75, aes(group = 1)) + theme(panel.background = element_rect(fill = 'white'), axis.title.x = element_blank(), axis.text.x = element_blank(), axis.ticks.x = element_blank(), axis.title.y = element_blank(), axis.text.y = element_blank(), axis.ticks.y = element_blank(), plot.title = element_text(size = 10)) + labs(title = '1) Workers Produced per Minute')

minimapattacks <- ggplot(sc, aes(sc$League, sc$MinimapRightClicks*88.5*60)) + geom_boxplot(outlier.shape = NA, fill = 'purple', alpha = 0.25) + geom_smooth(method = 'loess', se = FALSE, color = 'blue', size = 1.5, aes(group = 1)) + theme(panel.background = element_rect(fill = 'white'), axis.title.x = element_blank(), axis.text.x = element_blank(), axis.ticks.x = element_blank(), axis.title.y = element_blank(), axis.text.y = element_blank(), axis.ticks.y = element_blank(), plot.title = element_text(size = 10)) + labs(title = '2) Minimap Attacks Commands Issued per Minute') + scale_y_continuous(breaks = c(seq(0, 15, 1)))

minimaprightclicks <- ggplot(sc, aes(sc$League, sc$MinimapRightClicks*88.5*60)) + geom_boxplot(outlier.shape = NA, fill = 'purple', alpha = 0.25) + theme(panel.background = element_rect(fill = 'white'), panel.grid.major =  element_line(color = 'gray')) + labs(x = 'League', y = '3) Minimap Right Clicks Performed per Minute')

assignhotkeys <- ggplot(sc, aes(sc$League, sc$AssignToHotkeys*88.5*60)) + geom_boxplot(outlier.shape = NA, fill = 'purple', alpha = 0.25) + geom_smooth(method = 'loess', se = FALSE, color = 'blue', size = 0.75, aes(group = 1)) + theme(panel.background = element_rect(fill = 'white'), axis.title.x = element_blank(), axis.text.x = element_blank(), axis.ticks.x = element_blank(), axis.title.y = element_blank(), axis.text.y = element_blank(), axis.ticks.y = element_blank(), plot.title = element_text(size = 10)) + labs(title = '4) Hotkey Assignments Made per Minute')

selecthotkeys <- ggplot(sc, aes(sc$League, sc$SelectByHotkeys*88.5*60)) + geom_boxplot(outlier.shape = NA, fill = 'purple', alpha = 0.25) + geom_smooth(method = 'loess', se = FALSE, color = 'blue', size = 0.75, aes(group = 1)) + theme(panel.background = element_rect(fill = 'white'), axis.title.x = element_blank(), axis.text.x = element_blank(), axis.ticks.x = element_blank(), axis.title.y = element_blank(), axis.text.y = element_blank(), axis.ticks.y = element_blank(), plot.title = element_text(size = 10)) + labs(title = '5) Hotkey Groups Selected per Minute')

mapexplored <- ggplot(sc, aes(sc$League, sc$TotalMapExplored*88.5*60)) + geom_boxplot(outlier.shape = NA, fill = 'purple', alpha = 0.25) + geom_smooth(method = 'loess', se = FALSE, color = 'blue', size = 0.75, aes(group = 1)) + theme(panel.background = element_rect(fill = 'white'), axis.title.x = element_blank(), axis.text.x = element_blank(), axis.ticks.x = element_blank(), axis.title.y = element_blank(), axis.text.y = element_blank(), axis.ticks.y = element_blank(), plot.title = element_text(size = 10)) + labs(title = '6) Number of 24x24 Map Grids Explored per Minute')

complexability <- ggplot(sc, aes(sc$League, sc$ComplexAbilityUsed*88.5*60)) + geom_boxplot(outlier.shape = NA, fill = 'purple', alpha = 0.25) + geom_smooth(method = 'loess', se = FALSE, color = 'blue', size = 0.75, aes(group = 1)) + theme(panel.background = element_rect(fill = 'white'), axis.title.x = element_blank(), axis.text.x = element_blank(), axis.ticks.x = element_blank(), axis.title.y = element_blank(), axis.text.y = element_blank(), axis.ticks.y = element_blank(), plot.title = element_text(size = 10)) + labs(title = '7) Targeted Abilities Used per Minute')

complexunits <- ggplot(sc, aes(sc$League, sc$ComplexUnitsMade*88.5*60)) + geom_boxplot(outlier.shape = NA, fill = 'purple', alpha = 0.25) + geom_smooth(method = 'loess', se = FALSE, color = 'blue', size = 0.75, aes(group = 1)) + theme(panel.background = element_rect(fill = 'white'), axis.title.x = element_blank(), axis.text.x = element_blank(), axis.ticks.x = element_blank(), axis.title.y = element_blank(), axis.text.y = element_blank(), axis.ticks.y = element_blank(), plot.title = element_text(size = 10)) + labs(title = '8) Advanced Units Produced per Minute')

uniquehotkeys <- ggplot(sc, aes(sc$League, sc$UniqueHotkeys*88.5*60)) + geom_boxplot(outlier.shape = NA, fill = 'purple', alpha = 0.25) + geom_smooth(method = 'loess', se = FALSE, color = 'blue', size = 0.75, aes(group = 1)) + theme(panel.background = element_rect(fill = 'white'), axis.title.x = element_blank(), axis.text.x = element_blank(), axis.ticks.x = element_blank(), axis.title.y = element_blank(), axis.text.y = element_blank(), axis.ticks.y = element_blank(), plot.title = element_text(size = 10)) + labs(title = '9) Unique Hotkeys Used per Minute')

grid.arrange(workers, minimapattacks, assignhotkeys, selecthotkeys, mapexplored, complexability, complexunits, uniquehotkeys)

#Commenting out code that was preventing my notebook from knitting

#subset <- data.frame(c(sc[3:5], sc[6:8]))
#corr_subset <- cor(subset)
#corrplot(corr_subset, method = 'number', diag = FALSE, type = 'upper', col = viridis(256), title = 'Significant #Correlations Among Skills and Time Variables', mar=c(0,0,1,0))

6. Conclusion

As mentioned during the introduction and main analysis, a major limitation of this dataset is the fact that the time variables are self-reported by participants. Also, since each data point only represents one player’s skill level at a single point in time (one game), there isn’t a way to observe that player’s skill over time. By default, SC2 saves a dated replay for every single game played to a local folder. Even when the game is uninstalled, this folder is not automatically removed from the computer. Therefore, these problems could be solved by collecting a second, more comprehensive dataset, one where each participant submits the entire contents of their replay folder (or as many of the replays as possible). With such a dataset, player performance can be tracked across a longer period of time. In addition, TotalHours can be directly calculated by summing the lengths of every game in the replay folder. HoursPerWeek can subsequently be calculated by binning the games by calendar week based on the dates in the file names and taking an average across the total number of weeks spanned by each player’s replay file collection.

The main lesson that I learned from analyzing this dataset is to be skeptical of the variable names and descriptions provided by whoever collected the data. These descriptions could be vague or even incorrect. Domain knowledge is valuable in deciding whether one’s interpretation of a vaguely described variable is incorrect. Without domain knowledge, it would be helpful to try transforming the variables to a more familiar form (in this case, converting rate variables from timestamps to minutes). This transformation, combined with some research on the context and origin of the dataset, can be helpful in deciding if the resulting values are sensible given the proposed definition of the variable.